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Abstract 

We point out some pitfalls related to the concept of an oracle property as used in Fan 
and Li (2001, 2002, 2004) which are reminiscent of the well-known pitfalls related to Hodges' 
estimator. The oracle property is often a consequence of sparsity of an estimator. We show 
that any estimator satisfying a sparsity property has maximal risk that converges to the 
supremum of the loss function; in particular, the maximal risk diverges to infinity whenever 
the loss function is unbounded. For ease of presentation the result is set in the framework of 
a linear regression model, but generalizes far beyond that setting. In a Monte Carlo study 
we also assess the extent of the problem in finite samples for the smoothly clipped absolute 
deviation (SCAD) estimator introduced in Fan and Li (2001). We find that this estimator 
can perform rather poorly in finite samples and that its worst-case performance relative to 
maximum likelihood deteriorates with increasing sample size when the estimator is tuned 
to sparsity. 

AMS 2000 Subject Classifications: Primary 62J07, 62C99; secondary 62E20, 62F10, 
62F12 
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1 Introduction 



Recent years have seen an increased interest in penalized least squares and penalized maximum 
likelihood estimation. Examples are the class of Bridge estimators introduced by Frank and 
Friedman (1993), which includes Lasso-type estimators as a special case (Knight and Fu (2000)), 
or the smoothly clipped absolute deviation (SCAD) estimator introduced in Fan and Li (2001) 
and further discussed in Fan and Li (2002, 2004), Fan and Peng (2004), and Cai, Fan, Li, and 
Zhou (2005). As shown in Fan and Li (2001), the SCAD estimator, with appropriate choice 
of the regularization (tuning) parameter, possesses a sparsity property, i.e., it estimates zero 
components of the true parameter vector exactly as zero with probability approaching one as 
sample size increases while still being consistent for the non-zero components. An immediate 
consequence of this sparsity property of the SCAD estimator is that the asymptotic distribution 
of this estimator remains the same whether or not the correct zero restrictions are imposed in 
the course of the SCAD estimation procedure. [This simple phenomenon is true more generally 
as pointed out, e.g., in Potscher (1991, Lemma 1).] In other words, with appropriate choice of 
the regularization parameter, the asymptotic distribution of the SCAD estimator based on the 
overall model and that of the SCAD estimator derived from the most parsimonious correct model 
coincide. Fan and Li (2001) have dubbed this property the "oracle property" and have advertised 
this property of their estimator. 1 For appropriate choices of the regularization parameter, the 
sparsity and the oracle property are also possessed by several - but not all - members of the 
class of Bridge estimators (Knight and Fu (2000), p. 1361, Zou (2006)). Similarly, suitably 
tuned thresholding procedures give rise to sparse estimators. 2 Finally, we note that traditional 
post-model-selection estimators (e.g., maximum likelihood estimators following model selection) 
based on a consistent model selection procedure (for example, BIC or test procedures with 
suitably chosen critical values) are another class of estimators that exhibit the sparsity and 
oracle property; see Potscher (1991) and Lecb and Potscher (2005) for further discussion. In a 
recent paper, Bunca (2004) uses such procedures in a semiparamctric framework and emphasizes 
the oracle property of the resulting estimator; see also Bunea and McKeague (2005). 

At first sight, the oracle property appears to be a desirable property of an estimator as it 
seems to guarantee that, without knowing which components of the true parameter are zero, 
we can do (asymptotically) as well as if we knew the correct zero restrictions; that is, we can 
"adapt" to the unknown zero restrictions without paying a price. This is too good to be true, and 

1 The oracle property in the sense of Fan and Li should not be confused with the notion of an oracle inequality 

as frequently used elsewhere in the literature. 

2 Thesc estimators do not satisfy the oracle property in case of non-orthogonal design. 
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it is reminiscent of the "superefficiency" property of the Hodges' estimator; and justly so, since 
Hodges' estimator in its simplest form is a hard-thresholding estimator exhibiting the sparsity 
and oracle property. [Recall that in its simplest form Hodges' estimator for the mean of an 
N(n, l)-distribution is given by the arithmetic mean y of the random sample of size n if \y\ 
exceeds the threshold n -1 / 4 , and is given by zero otherwise.] Now, as is well-known, e.g., from 
Hodges' example, the oracle property is an asymptotic feature that holds only pointwise in the 
parameter space and gives a misleading picture of the actual finite-sample performance of the 
estimator. In fact, the finite sample properties of an estimator enjoying the oracle property 
are often markedly different from what the pointwise asymptotic theory predicts; e.g., the finite 
sample distribution can be bimodal regardless of sample size, although the pointwise asymptotic 
distribution is normal. This is again well-known for Hodges' estimator. For a more general 
class of post-model-selection estimators possessing the sparsity and the oracle property this is 
discussed in detail in Leeb and Potscher (2005), where it is, e.g., also shown that the finite 
sample distribution can "escape to infinity" along appropriate local alternatives although the 
pointwise asymptotic distribution is perfectly normal. 3 See also Knight and Fu (2000, Section 3) 
for related results for Bridge estimators. Furthermore, estimators possessing the oracle property 
are certainly not exempt from the Hajek-LeCam local asymptotic minimax theorem, further 
eroding support for the belief that these estimators are as good as the "oracle" itself (i.e., the 
infeasible "estimator" that uses the information which components of the parameter are zero). 

The above discussion shows that the reasoning underlying the oracle property is misguided. 
Even worse, estimators possessing the sparsity property (which often entails the oracle property) 
necessarily have dismal finite sample performance: It is well-known for Hodges' estimator that 
the maximal (scaled) mean squared error grows without bound as sample size increases (e.g., 
Lchmann and Casella (1998), p. 442), whereas the standard maximum likelihood estimator has 
constant finite quadratic risk. In this note we show that a similar unbounded risk result is in fact 
true for any estimator possessing the sparsity property. This means that there is a substantial 
price to be paid for sparsity even though the oracle property (misleadingly) seems to suggest 
otherwise. As discussed in more detail below, the bad risk behavior is a "local" phenomenon 
and furthermore occurs at points in the parameter space that are "sparse" in the sense that 
some of their coordinates are equal to zero. For simplicity of presentation and for reasons 
of comparability with the literature cited earlier, the result will be set in the framework of a 
linear regression model, but inspection of the proof shows that it easily extends far beyond that 

3 That pointwise asymptotics can be misleading in the context of model selection has been noted earlier in 
Hosoya (1984), Shibata (1986a), Potscher (1991), and Kabaila (1995, 2002). 
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framework. For related results in the context of traditional post-model-selection estimators see 
Yang (2005) and Leeb and Potscher (2005, Appendix C); 4 cf. also the discussion on "partially" 
sparse estimators towards the end of Section 2 below. The theoretical results in Section 2 are 
rounded out by a Monte Carlo study in Section 3 that demonstrates the extent of the problem 
in finite samples for the SCAD estimator of Fan and Li (2001). The reasons for concentrating 
on the SCAD estimator in the Monte Carlo study are (i) that the finite-sample risk behavior 
of traditional post-modcl-selection estimators is well-understood (Judge and Bock (1978), Leeb 
and Potscher (2005)) and (ii) that the SCAD estimator - especially when tuned to sparsity - 
has been highly advertised as a superior procedure in Fan and Li (2001) and subsequent papers 
mentioned above. 



2 Bad Risk Behavior of Sparse Estimators 

Consider the linear regression model 

Vt = x' t e + e t (l<t<n) (1) 

where the k x 1 nonstochastic regressors x t satisfy n^ 1 Y^t=i XtX 't ~ > <3 > as n ^ oo and the 
prime denotes transposition. The errors e t are assumed to be independent identically distributed 
with mean zero and finite variance a 2 . Without loss of generality we freeze the variance at 
a 2 = l. 5 Furthermore, we assume that e t has a density / that possesses an absolutely continuous 
derivative df/dx satisfying 

oo 

0< J ((df(x)/dx)/f(x)) 2 f(x)dx<^. 

— oo 

Note that the conditions on / guarantee that the information of / is finite and positive. These 
conditions are obviously satisfied in the special case of normally distributed errors. Let P n fi 
denote the distribution of the sample (j/i, . . . , y n )' and let E n _o denote the corresponding expec- 
tation operator. For 9 <E R fc , let r(9) denote afcxl vector with components Ti{9) where r,(#) = 
if 9i = and r,(#) = 1 if $i ^ 0. An estimator 9 for 9 based on the sample (j/i, . . . , y n )' is said 
to satisfy the sparsity-type condition if for every fleR 4 

Pn,e (r(9) < r(0)) -> 1 (2) 

4 The unboundedness of the maximal (scaled) mean squared error of estimators following BIC-type model 

selection has also been noted in Hosoya (1984), Shibata (1986b), and Foster and George (1994). 

5 If the variance is not frozen at a 2 = 1, the results below obviously continue to hold for each fixed value of a 2 , 

and hence hold a fortiori if the suprcmum in (3)-(4) below is also taken over a 2 . 
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holds for n —* oo, where the inequality sign is to be interpreted componentwise. That is, the 
estimator is guaranteed to find the zero components of 9 with probability approaching one as 
n —* oo. Clearly, any sparse estimator satisfies (2). In particular, the SCAD estimator as 
well as certain members of the class of Bridge estimators satisfy (2) for suitable choices of the 
rcgularization parameter as mentioned earlier. Also, any post-model-selection estimator based on 
a consistent model selection procedure clearly satisfies (2). All these estimators are additionally 
also consistent for 9, and hence in fact satisfy the stronger condition P n j(r(6) = r(9)) — > 1 
for all 9 £ R fe . [Condition (2) by itself is of course also satisfied by nonsensical estimators like 
6 = 0, but is all that is needed to establish the subsequent result.] We now show that any 
estimator satisfying the sparsity-type condition (2) has quite bad finite sample risk properties. 
For purposes of comparison we note that the (scaled) mean squared error of the least squares 
estimator 9ls satisfies 



E, 



n.O 



n(hs ~ 9)'{9 LS - 9) 



— trace 



_1 E 
t=i 



x t x t 



which converges to trace(Q 1 ), and thus remains bounded as sample size increases. 

Theorem 2.1 6 Let 9 be an arbitrary estimator for 9 that satisfies the sparsity-type condition 
(2). Then the maximal (scaled) mean squared error of 9 diverges to infinity as n — > oo, i.e., 



sup E n>g 



n(9-9)'(9-9) -^oo (3) 
for n — > oo. More generally, let I : R k — > M be a nonnegative loss function. Then 



sup E nfi l{n 1/2 {9 - 9)) -> sup l(s) (4) 

8<ER k s£l* 

for n — > oo. In particular, if the loss function I is unbounded then the maximal risk associated 
with I diverges to infinity as n — > oo. 

The theorem says that, whatever the loss function, the maximal risk of a sparse estimator is 
- in large samples - as bad as it possibly can be. 



"Theorem 2.1 and the ensuing discussion continue to apply if the rcgrcssors xt as well as the errors et are 
allowed to depend on sample size n, at least if the errors are normally distributed. The proof is analogous, except 
that one uses direct computation and LeCam's first lemma (cf., e.g., Lemma A.l in Leeb and Potscher (2006)) 
instead of Koul and Wang (1984) to verify contiguity. Also, the results continue to hold if the design matrix 
satisfies S~ J27=i x t x 't ^ Q > for some positive sequence 5 n other than n, provided that the scaling factor 
«V2 is 

replaced by Si/ 2 throughout. 
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Upon choosing l(s) = |sj|, where Sj denotes the i-th coordinate of s, relation (4) shows that 
also the maximal (scaled) absolute bias of each component 0\ diverges to infinity. 

Applying relation (4) to the loss function l*(s) = l(c's) shows that (4) holds mutatis mutandis 
also for estimators c0 of arbitrary linear contrasts c'0. In particular, using quadratic loss l*(s) = 
(c's) 2 , it follows that also the maximal (scaled) mean squared error of the linear contrast c0 
goes to infinity as sample size increases, provided c ^ 0. 

Proof of Theorem 2.1: It suffices to prove (4). 7 Now, with n = —n^ 1 l 2 s, s e R k arbitrary, 
we have 

sup l(u) > sup E nfi l{n 1 ' 2 (0 - 0)) > E nfi J{n 1 / 2 {0 - 6 n )) 

> E rh9 Jl(n 1/2 (0 - 6 n ))l(9 = 0)] = l{-n^ 2 n )P n fin (r(0) = 0) 

= l{ a )P n ,e n (r{0) = O). (5) 

By the sparsity-type condition we have that P n ,o{ r (0) = 0) — > 1 as n — > oo. Since the model 
is locally asymptotically normal under our assumptions (Koul and Wang (1984), Theorem 2.1 
and Remark 1; Hajek and Sidak (1967), p. 213), the sequence of probability measures P n ,e n is 
contiguous w.r.t. the sequence P n ,o- Consequently, the far r.h.s. of (5) converges to l(s). Since 
s e R k was arbitrary, the proof is complete. ■ 

Inspection of the proof shows that Theorem 2.1 remains true if the supremum of the risk in 
(4) is taken only over open balls of radius p n centered at the origin as long as n 1 ^ 2 p n — ► oo. 
Hence, the bad risk behavior is a local phenomenon that occurs in a part of the parameter space 
where one perhaps would have expected the largest gain over the least squares estimator due 
to the sparsity property. [If the supremum of the risk in (4) is taken over the open balls of 
radius n~ x / 2 p centered at the origin where p > is now fixed, then the proof still shows that the 
limit inferior of this supremum is not less than sup|| s || <p Z(s).] Furthermore, for quadratic loss 
l(s) = s's, a small variation of the proof shows that these "local" results continue to hold if the 
open balls over which the supremum is taken are not centered at the origin, but at an arbitrary 
9, as long as possesses at least one zero component. [It is easy to see that this is more generally 
true for any nonnegative loss function I satisfying, e.g., l(s) > l(iTi(s)) for every s € R k and an 
index i with 0, = 0, where ni represents the projection on the i-th coordinate axis.] 

Inspection of the proof also shows that - at least in the case of quadratic loss - the element 
s can be chosen to point in the direction of a standard basis vector. This then shows that the 
bad risk behavior occurs at parameter values that themselves are "sparse" in the sense of having 

7 Note that the expectations in (3) and (4) are always well-defined. 
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many zero coordinates. 

If the quadratic loss n{9 — 9)'{9 — 9) in (3) is replaced by the weighted quadratic loss 
(9 — 9)' Y^t=i x t x 't(0 ~ 0), then the corresponding maximal risk again diverges to infinity. More 
generally, let l n be a nonnegative loss function that may depend on sample size. Inspection of 
the proof of Theorem 2.1 shows that 

limsup sup l n (u) > limsup sup E ni el n (n 1 ^ 2 (6 — 9)) > sup limsupZ„(«), (6) 

n^oo MgRfc n— >oo ||g || <„- i/2 p ||"||<p n^oo 

liminf sup l n (u) > liminf sup E n el n {n 1 ^ 2 — 9)) > sup liminf Z„(u) (7) 

™^°° ueR k i|6||<n- 1 /2 p ||«||<p ™^°° 

hold for any < p < oo. [In case < p < oo, the lower bounds in (6)-(7) can even be improved to 
limsupjj^^ sup|| M || <p l n (u) and liminfn^oo sup|| M || <)3 l n (u), respectively. 8 It then follows that in 
case p = oo the lower bounds in (6)-(7) can be improved to sup 0<T<oo limsup,^^ sup|| M || <T l n (u) 
and sup 0<r<oo liminf „^oo sup|| u || <r l n (u), respectively] 

Next we briefly discuss the case where an estimator 9 only has a "partial" sparsity property 
(and consequently a commensurable oracle property) in the following sense: Suppose the param- 
eter vector 9 is partitioned as 9 — (a', /?')' and the estimator 9 — (a',/3)' only finds the true 
zero components in the subvector (3 with probability converging to one. E.g., 9 is a traditional 
post-model-selection estimator based on a consistent model selection procedure that is designed 
to only identify the zero components in j3. A minor variation of the proof of Theorem 2.1 im- 
mediately shows again that the maximal (scaled) mean squared error of $, and hence also of 6, 
diverges to infinity, and the same is true for linear combinations d as long as d ^ 0. [This 
immediately extends to linear combinations c 9, as long as c charges at least one coordinate of $ 
with a nonzero coefficient.] 9 However, if the parameter of interest is a rather than /?, Theorem 
2.1 and its proof (or simple variations thereof) do not apply to the mean squared error of a (or 
its linear contrasts). Nevertheless, the maximal (scaled) mean squared error of a can again be 
shown to diverge to infinity, at least for traditional post-model-selection estimators 9 based on a 
consistent model selection procedure; see Leeb and Potscher (2005, Appendix C). 

While the above results are set in the framework of a linear regression model with non- 
stochastic regressors, it is obvious from the proof that they extend to much more general models 
such as regression models with stochastic regressors, scmiparamctric models, nonlinear models, 

8 Note that the local asymptotic normality condition in Koul and Wang (1984) as well as the result in Lemma 
A.l in Leeb and Potscher (2006) imply contiguity of P n ,8 n an d Pn,0 not only for 6 n = 'y/n 1 / 2 but more generally 
for d n = 7 n /n 1//2 with 7 n a bounded sequence. 

9 In fact, this variation of the proof of Theorem 2.1 shows that the suprcmum of E ni el(n 1 / 2 - /3)), where I 
is an arbitrary nonegative loss function, again converges to the supremum of the loss function. 
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time series models, etc., as long as the contiguity property used in the proof is satisfied. This 
is in particular the case whenever the model is locally asymptotically normal, which in turn is 
typically the case under standard regularity conditions for maximum likelihood estimation. 

3 Numerical Results on the Finite Sample Performance of 
the SCAD Estimator 

We replicate and extend Monte Carlo simulations of the performance of the SCAD estimator 
given in Example 4.1 of Fan and Li (2001); we demonstrate that this estimator, when tuned to 
enjoy a sparsity property and an oracle property, can perform quite unfavorably in finite samples. 
Even when not tuned to sparsity, we show that the SCAD estimator can perform worse than the 
least squares estimator in parts of the parameter space, something that is not brought out in the 
simulation study in Fan and Li (2001) as they conducted their simulation only at a single point 
in the parameter space (which happens to be favorable to their estimator). 

Consider n independent observations from the linear model (1) with k = 8 regressors, where 
the errors e t are standard normal and are distributed independently of the regressors. The 
regressors x t are assumed to be multivariate normal with mean zero. The variance of each 
component of x t is equal to 1 and the correlation between the i-th and the j-th component of 
x t , i.e., x t ,i and x t j, is p^^ with p = 0.5. Fan and Li (2001) consider this model with n = 40, 
n = 60, and with the true parameter equal to 8q — (3,1.5,0,0,2,0,0,0)'; cf. also Tibshirani 
(1996, Section 7.2). We consider a whole range of true values for 9 at various sample sizes, 
namely 6 n = #o + (j/\/n) x rj for some vector r\ and for a range of 7's as described below. We 
do this because (i) considering only one choice for the true parameter in a simulation may give 
a wrong impression of the actual performance of the estimators considered, and (ii) because our 
results in Section 2 suggest that the risk of sparse estimators can be large for parameter vectors 
which have some of its components close to, but different from, zero. 

The SCAD estimator is defined as a solution to the problem of minimizing the penalized least 
squares objective function 

n k 
t=l i=l 

where the penalty function p\ is defined in Fan and Li (2001) and A > is a tuning parameter. 
The penalty function p\ contains also another tuning parameter a, which is set equal to 3.7 here, 
resulting in a particular instance of the SCAD estimator which is denoted by SCAD2 in Example 
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4.1 of Fan and Li (2001). According to Theorem 2 in Fan and Li (2001) the SCAD estimator is 
guaranteed to satisfy the sparsity property if A — > and \fnX — > oo as samples size n goes to 
infinity. 

Using the MATLAB code provided to us by Runze Li, we have implemented the SCAD2 
estimator in R. [The code is available from the first author on request.] Two types of performance 
measures are considered: The 'median relative model error' studied by Fan and Li (2001), and 
the relative mean squared error. The median relative model error is defined as follows: For an 
estimator for 9, define the model error ME (6) by ME {9) = {9-9)'Y,(9-9), where E denotes the 
variance/covariance matrix of the regressors. Now define the relative model error of 9 (relative to 
least squares) by ME(0)/ME(9ls), with 9ls denoting the least squares estimator based on the 
overall model. The median relative model error is then given by the median of the relative model 
error. The relative mean squared error of 9 is given by E[{9 — 9)' (9 — 9)]/E[(9ls — 6)'(6ls ~ 0)]- 10 
Note that we have scaled the performance measures such that both of them are identical to unity 
for 9 = 9 LS . 

Setup I: For SCAD2 the tuning parameter A is chosen by generalized cross-validation (cf. 
Section 4.2 of Fan and Li (2001)). In the original study in Fan and Li (2001), the range of A's 
considered for generalized cross-validation at sample sizes n = 40 and n = 60 is {5 (a / \/n) : 
S = 0.9, 1.1, 1.3, . . . , 2}; here, a 2 denotes the usual unbiased variance estimator obtained from a 
least-squares fit of the overall model. For the simulations under Setup I, we re-scale this range 
of A's by log nj log 60. With this, our results for 7 = replicate those in Fan and Li (2001) 
for n = 60; for the other (larger) sample sizes that we consider, the re-scaling guarantees that 
A — > and \/nX — > 00 and hence, in view of Theorem 2 in Fan and Li (2001), guarantees that 
the resulting estimator enjoys the sparsity condition. [For another choice of A sec Setup VI.] 
We compute Monte Carlo estimates for both the median relative model error and the relative 
mean squared error of the SCAD2 estimator for a range of true parameter values, namely 9 n = 
#0 + i'y/y/n) x (0,0,1,1,0,1,1,1)' for 101 equidistant values of 7 between and 8, and for 
sample sizes n = 60, 120, 240, 480, and 960, each based on 500 Monte Carlo replications (for 
comparison, Fan and Li (2001) use 100 replications). Note that the performance measures are 
symmetric about 7 = 0, and hence are only reported for nonnegative values of 7. The results are 
summarized in Figure 1 below. [For better readability, points in Figure 1 are joined by lines.] 



"The mean squared error of B^g is given by _Btrace((X'X)- 1 ) which equals trace(E- 1 )/(n-9) = 38/(3n-27) 
by von Rosen (1988, Theorem 3.1). 
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Median Relative Model Error of SCAD2 Relative Mean Squared Error of SCAD2 
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Figure 1: Monte Carlo performance estimates under the true parameter 8 n = 9o + (.l/y/n) x 
(0, 0, 1, 1, 0, 1, 1, 1)' , as a function of 7. The left panel gives the estimated median relative 
model error of SCAD2 for sample sizes n = 60, 120, 240, 480, 960. The right panel gives 
the corresponding results for the estimated relative mean squared error of SCAD2. Larger 
sample sizes correspond to larger maximal errors. For comparison, the gray line at one 
indicates the performance of the ordinary least squares estimator. 

In the Monte Carlo study of Fan and Li (2001), only the parameter value 8 n is considered. 
This corresponds to the point 7 = in the panels of Figure 1. At that particular point in the 
parameter space, SCAD2 compares quite favorably with least squares. However, Figure 1 shows 
that there is a large range of parameters where the situation is reversed. In particular, we see that 
SCAD2 can perform quite unfavorably when compared to least squares if the true parameter, 
i.e., 6 n , is such that some of its components are close to, but different from, zero. In line with 
Theorem 2.1, we also see that the worst-case performance of SCAD2 deteriorates with increasing 
sample size: For n = 60, ordinary least squares beats SCAD2 in terms of worst-case performance 
by a factor of about 2 in both panels of Figure 1; for n = 960, this factor has increased to about 
3; and increasing n further makes this phenomenon even more pronounced. We also see that, for 
increasing n, the location of the peak moves to the right in Figure 1, suggesting that the worst- 
case performance of SCAD2 (among parameters of the form 6 n = ( / y/\/n) x (0, 0, 1, 1, 0, 1, 1, 1)') 
is attained at a value 7„, which is such that j n — > 00 with n. In view of the proof of Theorem 2.1, 
this is no surprise. 11 [Of course, there may be other parameters at any given sample size for 

11 See Section 2.1 and Footnote 14 in Leeb and Potscher (2005) for related discussion. 
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which SCAD2 performs even worse.] Our simulations thus demonstrate: If each component of 
the true parameter is either very close to zero or quite large (where the components' size has 
to be measured relative to sample size), then the SCAD estimator performs well. However, if 
some component is in-between these two extremes, then the SCAD estimator performs poorly. 
In particular, the estimator can perform poorly precisely in the important situation where it is 
statistically difficult to decide whether some component of the true parameter is zero or not. Poor 
performance is obtained in the worst case over a neighborhood of one of the lower-dimensional 
models, where the 'diameter' of the neighborhood goes to zero slower than 1/y/n. 

We have also re-run our simulations for other experimental setups; the details are given below. 
Since our findings for these other setups are essentially similar to those summarized in Figure 1, 
we first give a brief overview of the other setups and summarize the results before proceeding 
to the details. In Setups II and III we consider slices of the 8-dimcnsional performance measure 
surfaces corresponding to directions other than the one used in Setup I: In Setup II the true 
parameter is of the form 8q + (7/ ^Jn) x (0, 0, 1, 1, 0, 0, 0, 0)', i.e., we consider the case where some 
components are exactly zero, some are large, and others are in-between. In Setup III, we consider 
a scenario in-between Setup I and Setup II, namely the case where the true parameter is of the 
form #o + (7 / 'Vn) x (0, 0, 1, 1, 0, 1/10, 1/10, 1/10)'. The method for choosing A in these two setups 
is the same as in Setup I. The results in these additional setups are qualitatively similar to those 
shown in Figure 1 but slightly less pronounced. In further setups we also consider various other 
rates for the SCAD tuning parameter A. By Theorem 2 of Fan and Li (2001), the SCAD estimator 
is sparse if A — > and y/nX — > 00; as noted before, for Figure 1, A is chosen by generalized cross- 
validation from the set A n = {(5(ir/ v / n)(log(n)/log(60)) : S — 0.9, 1.1, 1.3, . . . , 2}; i.e., we have 
\fn\ = O p (log(ra)). The magnitude of A has a strong impact on the performance of the estimator. 
Smaller values result in 'less sparse' estimates, leading to less favorable performance relative to 
least squares at 7 — 0, but at the same time leading to less unfavorable worst-case performance; 
the resulting performance curves are 'flatter' than those in Figure 1. Larger values of A result 
in 'more sparse' estimates, improved performance at 7 = 0, and more unfavorable worst-case 
performance; this leads to performance curves that are 'more spiked' than those in Figure 1. In 
Setups IV and V we have re-run our simulations with 7 chosen from a set A„ as above, but with 
log(n)/ log(60) replaced by (n/60) 1 / 10 as well as by (n/60) 1 / 4 , resulting in ^/nX = O p (n 1 / 10 ) and 
\/n\ = O p (n 1 / 4 ), respectively. In Setup IV, where ^/nX = O p (n 1 / 10 ), we get results similar to, 
but less pronounced than, Figure 1; this is because Setup IV leads to A's smaller than in Setup I. 
In Setup V, where ^/nX = O p (n 1 / 4 ), we get similar but more pronounced results when compared 
to Figure 1; again, this is so because Setup V leads to larger A's than Setup I. A final setup 
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(Setup VI) in which we do not enforce the conditions for sparsity is discussed below after the 
details for Setups II- V are presented. 

Setups II and III: In Setup II, we perform the same Monte Carlo study as in Setup I, the 
only difference being that the range of 0's is now 8 n = 80 + (j/y/n) x (0, 0, 1, 1, 0, 0, 0, 0)' for 101 
equidistant values of 7 between and 8. The worst-case behavior in this setup is qualitatively 
similar to the one in Setup I but slightly less pronounced; we do not report the results here for 
brevity. In Setup III, we again perform the same Monte Carlo study as in Setup I, but now with 
n =80 + {l/Vn) x (0,0,1, 1,0, 1/10,1/10,1/10)' for 101 equidistant values of 7 between and 
80. Note that here we consider a range for 7 wider than that in Scenario I and II, where we had 
< 7 < 8. Figure 2 gives the results for Setup III. 
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Figure 2: Monte Carlo performance estimates under the true parameter 8 n = #0 + {j/^/nj x 
(0, 0, 1, 1, 0, 1/10, 1/10, 1/10)', as a function of 7. See the legend of Figure 1 for a description 
of the graphics. 

The same considerations as given for Figure 1 also apply to Figure 2. The new feature in 
Figure 2 is that the curves are bimodal. Apparently, this is because now there are two regions, 
in the range of 7's under consideration, for which some components of the underlying regression 
parameter n are neither very close to zero nor quite large (relative to sample size): Components 
3 and 4 for 7 around 5 (first peak), and components 6, 7, and 8 for 7 around 40 (second peak). 

Setups IV and V: Here we perform the same simulations as in Setup I, but now with the 
range of A's considered for generalized cross-validation given by {<5(cr/v / n)(n/60) 1 / 10 : $ = 
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0.9, 1.1, 1.3, . . . , 2} for Setup IV, and by {6(a/ ^)(n/60)^ 4 : 5 = 0.9, 1.1, 1.3, . . . , 2} for Setup V. 
Setup IV gives 'less sparse' estimates while Setup V gives 'more sparse' estimates relative to 
Setup I. The results are summarized in Figures 3 and 4 below. Choosing the SCAD tuning- 
parameter A so that the estimator is 'more sparse' clearly has a detrimental effect on the esti- 
mator's worst-case performance. 
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Figure 3: Monte Carlo performance estimates under the true parameter 8 n = 0o + (■y/^/n) x 
(0, 0, 1, 1, 0, 1, 1, 1)' as a function of 7; the SCAD tuning parameter A is chosen as described 
in Setup IV. 
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Figure 4: Monte Carlo performance estimates under the true parameter 6 n = 6* + (7/Vn) x 
(0, 0, 1, 1, 0, 1, 1, 1)', as a function of 7; the SCAD tuning parameter A is chosen as described 
in Setup V. 

In all setups considered so far we have enforced the conditions A — > and y/n\ — > 00 to 
guarantee sparsity of the resulting SCAD estimator as risk properties of sparse estimators are 
the topic of the paper. In response to a referee we further consider Setup VI which is identical 
to Setup I, except that the range of A's over which generalized cross-validation is effected is 
given by {8(a/y/n) : S = 0.9, 1.1, 1.3, . . . , 2}, which is precisely the range considered in Fan and 
Li (2001). Note that the resulting A does now not satisfy the conditions for sparsity given in 
Theorem 2 of Fan and Li (2001). The results are shown in Figure 5 below. The findings are 
similar to the results from Setup I, in that SCAD2 gains over the least squares estimator in a 
neighborhood of 9o, but is worse by approximately a factor of 2 over considerable portions of the 
range of 7, showing once more that the simulation study in Fan and Li (2001) does not tell the 
entire truth. What is, however, different here from the results obtained under Setup I is that - 
not surprisingly at all - the worst case behavior now does not get worse with increasing sample 
size. [This is akin to the boundedncss of the worst case risk of a post-model-selection estimator 
based on a conservative model selection procedure like AIC or pre-testing with a sample-size 
independent critical value.] 
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Figure 5: Monte Carlo performance estimates under the true parameter 6 n = Oo + (j/\/n) x 
(0, 0, 1, 1, 0, 1, 1, 1)', as a function of 7; the SCAD tuning parameter A is chosen as described 
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in Setup VI. 



4 Conclusion 

We have shown that sparsity of an estimator leads to undesirable risk properties of that estimator. 
The result is set in a linear model framework, but easily extends to much more general parametric 
and semiparametric models, including time series models. Sparsity is often connected to a 
so-called "oracle property". We point out that this latter property is highly misleading and 
should not be relied on when judging performance of an estimator. Both observations are not 
really new, but worth recalling: Hodges' construction of an estimator exhibiting a deceiving 
pointwise asymptotic behavior (i.e., the oracle property in today's parlance) has led mathematical 
statisticians to realize the importance uniformity has to play in asymptotic statistical results. 
It is thus remarkable that today - more than 50 years later - we observe a return of Hodges' 
estimator in the guise of newly proposed estimators (i.e., sparse estimators). What is even more 
surprising is that the deceiving pointwise asymptotic properties of these estimators (i.e., the 
oracle property) are now advertised as virtues of these methods. It is therefore perhaps fitting 
to repeat Hajek's (1971, p. 153) warning: 

"Especially misinformative can be those limit results that are not uniform. Then the limit 
may exhibit some features that are not even approximately true for any finite n." 

The discussion in the present paper as well as in Leeb and Potscher (2005) shows in particular that 
distributional or risk behavior of consistent post-model-selection estimators is not as sometimes 
believed, but is much worse. 

The results of this paper should not be construed as a criticism of shrinkage-type estimators 
including penalized least squares (maximum likelihood) estimators per se. Especially if the 
dimension of the model is large relative to sample size, some sort of shrinkage will typically be 
beneficial. However, achieving this shrinkage through sparsity is perhaps not such a good idea 
(at least when estimator risk is of concern). It certainly cannot simply be justified through an 
appeal to the oracle property. 12 

12 In this context we note that "superefficiency" per se is not necessarily detrimental in higher dimensions 
as witnessed by the Stein phenomenon. However, not all forms of "superefficiency" arc created equal, and 
"superefficiency" generated through sparsity of an estimator typically belongs to the undesirable variety as shown 
in the paper. 
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